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Professors Candes and Tao are to be congratulated for their innovative 
and valuable contribution to high-dimensional sparse recovery and model 
selection. The analysis of vast data sets now commonly arising in scientific 
investigations poses many statistical challenges not present in smaller scale 
studies. Many of these data sets exhibit sparsity where most of the data 
corresponds to noise and only a small fraction is of interest. The needs 
of this research have excited much interest in the statistical community. 
In particular, high-dimensional model selection has attracted much recent 
attention and has become a central topic in statistics. The main difficulty 
of such a problem comes from collinearity between the predictor variables. 
It is clear from the geometric point of view that the collinearity increases as 
the dimensionality grows. 

A common approach taken in the statistics literature is the penalized 
likelihood, for example, Lasso (Tibshirani [11]) and adaptive Lasso (Zou 
[12]), SCAD (Fan and Li [7] and Fan and Peng [9]) and nonnegative garrote 
(Breiman [1]). Commonly used algorithms include LARS (Efron, Hastie, 
Johnstone and Tibshirani [6]), LQA (Fan and Li [7]) and MM (Hunter and 
Li [10]). In the present paper, Candes and Tao take a new approach, called 
the Dantzig selector, which uses £i-minimization with regularization on the 
residuals. One promising fact is that the Dantzig selector solves a linear 
program, usually faster than the existing methods. In addition, the authors 
establish that, under the Uniform Uncertainty Principle (UUP), with large 
probability the Dantzig selector mimics the risk of the oracle estimator up 
to a logarithmic factor logp, where p denotes the number of variables. 

We appreciate the opportunity to comment on several aspects of this 
article. Our discussion here will focus on four issues: (1) connection to sparse 
signal recovery in the noiseless case; (2) the UUP condition and identifiability 
of the model; (3) computation and model selection; (4) minimax rate. 
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1. Sparse signal recovery. The "large p, small n" regression problem 
considered in this paper can be viewed as a generalization of the classical 
linear algebra problem in which one wishes to solve the linear equation 



where X is a given nx p matrix and y is a vector in M n . Because p> n, the 
linear equation (1) is underdetermined and there are an infinite number of 
solutions to the equation. The goal is to find the "sparsest" solution under 
certain regularity conditions. In this noiseless setting, the Dantzig selector 
reduces to an l\ -minimization over the space of all representations of the 
signal: 

(2) Minimize ||/?||i subject to y = X(5. 

This idea of finding a sparse representation using l\ -minimization has been 
used in Donoho and Elad [4]. The authors have also used this approach in 
their earlier work (Candes and Tao [2, 3]) on recovering sparse signals in 
the noiseless case. When adding a Gaussian noise term e to (1), the linear 
algebra problem becomes a nonstandard linear regression problem because 
p^> n. In the classical linear regression problem when p<n the least squares 
estimator is the solution to the normal equation 



The constraint ||X T r|| 00 < X p a (which is the same as ||A" T y — X^^XPW^ < 
Xpd) in the convex program (DS) in this paper can be viewed as a relax- 
ation of the normal equation (3). And similarly to the noiseless case l\- 
minimization leads to the "sparsest" solution over the space of all feasible 
solutions. 

The authors suggest using X p = ^/2\ogp, which is equal to y / 2logn in 
the orthogonal design setting. In this setting, the oracle properties of the 
Dantzig selector are in line with those shrinkage results in Donoho and 
Johnstone [5] which are shown to be optimal in the minimax sense. When 
p S> n, it might be possible that the regularization factor X p = ^/2Togp in the 
Dantzig selector overshrinks the p- vector (3 and underestimates the nonzero 
coordinates. It would be interesting to find the "optimal" regularization 
factor. 

2. The UUP condition and identifiability of the model. A nice idea in 
this paper is the use of the UUP condition. The UUP condition has also 
been used by the authors in their earlier work (Candes and Tao [2, 3]) in the 
noiseless setting. The UUP condition roughly says that for any small set of 
predictors, these n-vectors are nearly orthogonal to each other. The authors 
give an interpretation of the UUP condition in terms of the model identifia- 
bility and have established oracle inequalities for the Dantzig selector under 



(1) 



y = X(3 



(3) 



X T y = X T Xp. 
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Fig. 1. Distribution of the maximum absolute sample correlation when n = 60, p n = 1000 
(solid curve) and n = 60, p n = 5000 (dashed curve). 



the UUP using geometric arguments. However, we still have some concerns 
about this condition. 

First, it is computationally unrealistic to verify whether the UUP condi- 
tion holds for a given design matrix X when p is large and the number of 
signals S is not too small. Note that computing the S*-restricted isometry 
constant 5s of the design matrix X is over the space of all 5-subsets of 
{1,. . . ,p}, which is of cardinality (<0. This combinatorial complexity makes 
it infeasible to check the UUP condition for reasonable values of p and S, say 
p = 1000 and S = 5. So it is interesting to look for other checkable conditions 
that are compatible with the model identifiability condition in a similar or 
new setup. 

Second, as discussed in Fan and Lv [8], the UUP is hard to satisfy when 
the dimension p grows rapidly with the sample size n. This is essentially due 
to significant sample correlation, that is, strong collinearity, between the pre- 
dictor variables in the high-dimensional setting of p 3> n. For instance, we 
take p independent predictors X\, . . . , X p from the standard Gaussian distri- 
bution and compute the maximum of the pairwise absolute sample correla- 
tions from an n x p design matrix X . Figure 1, which is extracted from Fan 
and Lv [8], shows the distributions of the maximum correlation with n = 60, 
p = 1000 and n = 60, p = 5000, respectively. The maximum sample corre- 
lation between predictors can be very large and close to 1. Moreover, the 
maximum of the first canonical correlations between two groups of predic- 
tors, for example, three predictors in one group and five in another, can be 
much larger since there are (^)( p ~^ 3 ) = 0(p 8 ) (3> n) choices in this example. 

3. Computation and model selection. Due to its nature of involving lin- 
ear programming, the Dantzig selector can be solved quickly and efficiently 
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by a primal-dual interior point algorithm when the dimension is not ul- 
trahigh, for example, in the thousands. It is usually faster to implement 
than other existing methods such as Lasso. However, in problems of large 
or ultralarge scale the computational cost of implementing linear programs 
is still a potential hurdle. For example, in the analysis of microarray gene 
expression or proteomics data, it is common to have dimension of tens of 
thousands and implementing linear programs in such settings can still be 
computationally challenging. Therefore, it is interesting and necessary to 
study ultrahigh-dimensional model selection. Recently, Fan and Lv [8] in- 
troduced a procedure for screening variables via iteratively thresholded ridge 
regression and proposed a new method of dimension reduction called Sure In- 
dependence Screening, for ultrahigh-dimensional feature space. The method 
can improve the estimation accuracy significantly while speeding up variable 
selection drastically. 

In our experience, we have found that the algorithm solving the Dantzig 
selector is sensitive to the initial value. Trivial initial values such as constant 
vectors usually do not work well. The generalized least squares estimator 
(X T X)~X T y can be used as an initial value. However, the solution is usually 
nonsparse in our experience. 

As mentioned earlier, the regularization factor \ p = \/2logp in the Dantzig 
selector leads to relatively large bias in estimating the sparse regression co- 
efficient vector (3. To reduce the bias, the authors suggest a two-stage pro- 
cedure called the Gauss-Dantzig selector which uses the original Dantzig 
selector for variable selection and then runs ordinary least squares on the 
selected variables. It would be interesting to know the theoretical and nu- 
merical properties of this and other variations. 

4. Minimax rate. It is appealing that the Dantzig selector achieves within 
a logarithmic factor logp of the ideal risk. We are curious about the opti- 
mality of this factor. It is unclear at this point whether the minimax factor 
should be logp or logra or some other quantity. When p is polynomial in 
the sample size n, logp and logn are of the same order and the difference 
between the two is not significant. However, when p is exponential in n, 
say p = e n for some a > 0, then logp = n a becomes large, much larger than 
logn. It is of theoretical and practical interest to study the minimax behav- 
ior of the problem. If the minimax rate is logp, then the Dantzig selector is 
a rate optimal minimax procedure. Otherwise, it is interesting to construct 
a procedure that can attain the minimax rate. 

5. Concluding remarks. i\ -regularization in terms of linear program- 
ming provides a new framework for model selection and is proven effective 
in solving high-dimensional problems. The Dantzig selector provides us new 
insight on high-dimensional model selection. Clearly, there is much work 
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ahead of us. This paper opens a new area and will definitely stimulate new 
ideas in the future. We thank the authors for their clear and imaginative 
work. 
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